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ABSTRACT 


Available literature reports several lymphoma cases misdiagnosed as 
tuberculosis, especially in countries with a heavy TB burden. This frequent 
misdiagnosis is due to the fact that the two diseases can present with similar 
symptoms. The present study therefore aims to analyse and explore TB as 
well as lymphoma case reports using Natural Language Processing tools and 


evaluate the use of machine learning to differentiate between the two diseases. 
As a Starting point in the study, case reports were collected for each disease 
using web scraping. Natural language processing tools and text clustering 
were then used to explore the created dataset. Finally, six machine learning 
algorithms were trained and tested on the collected data, which contained 765 
lymphoma and 546 tuberculosis case reports. Each method was evaluated 
using various performance metrics. The results indicated that the multi-layer 
perceptron model achieved the best accuracy (93.1%), recall (91.9%) and 
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Tuberculosis precision score (93.7%), thus outperforming other algorithms in terms of 
correctly classifying the different case reports. 
This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

According to the United Nations Programme on HIV/AIDS (UNAIDS), tuberculosis (TB) is the 
deadliest infectious disease worldwide [1]. It is therefore a major threat to global health. In an effort to 
combat this threat, countries like South Africa administer empiric TB treatment to patients likely to be 
suffering from the disease [2], [3]. This means that the patients receive treatment while awaiting for their TB 
laboratory results. 

TB symptoms include fatigue, fever, dyspnea and night sweats. On radiological images, the disease 
can present as masses and fold thickenings [4]. However these symptoms are not unique to tuberculosis, 
leading to possible misdiagnoses. One disease that is often misdiagnosed as TB is lymphoma, a cancer which 
occurs when lymphocytes inside the lymph nodes multiply too fast or live too long [5], [6]. 

Many cases of lymphoma have been diagnosed as TB, as reported by [7]-[9], with the misdiagnosed 
patients receiving TB treatment while their cancer progresses. This is why the current study investigates the 
use of machine learning and natural language processing (NLP) in differentiating between the two diseases. 

Machine learning (ML) is a sub-field of artificial intelligence (AI) which aims to process data, 
identify patterns in them and learn those patterns, without being explicitly programmed to do so [10], [11]. It 
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has proven to improve diagnosis prediction, as reported by [12]. NLP, on the other hand, focuses on 
extracting information from unstructured texts and converting it into a format that computers can process 
[13]. It has been successfully implemented in decision support systems for areas such as risk stratification, 
symptom identification and medical diagnosis [14], [15]. 

Although generic text classifiers exist, they generally are not tuned for the scientific data analysis 
[16]. This is why specific NLP systems have been used to classify patients with TB since the early 1990s [17] 
[18], and differentiate between TB and other pulmonary diseases [19]. To the best of our knowledge, there is 
however no system differentiating specifically between TB and lymphoma. Hence the overall purpose of our 
study is to create an NLP system to classify lymphoma and TB diagnosis. The system could serve for 
screening purposes and help reduce the misdiagnosis rate between the two diseases. 

In our previous paper [20], we classified the two diseases using case reports collected from 
ScienceDirect. The features in each report were extracted using TF-IDF as well as Amazon Medical 
Comprehend, which is an NLP API for medical feature extraction. The current paper aims to; 1) analyse the 
collected case reports using NLP and clustering, 2) explore their different characteristics, 3) and identify 
documents which are not case reports of either diseases using machine learning algorithms. 

This will help us collect additional relevant case reports from various sources, and design a more 
robust training dataset to be used in differentiating TB and lymphoma. All algorithms in this study are 
implemented using the “sklearn” Python module [21], using the default parameters, with no parameter 
tuning. Another limitation of this study is that it disregards the semantic value when extracting terms from 
the collected text. The rest of this paper is organised as; section 2 discusses the methods used in this study 
while section 3 presents the results of the various experiments performed. Finally, we discuss the results 
obtained in section 4 and conclude this paper in section 5. 


2. RESEARCH METHOD 
2.1. Data collection 

Figure 1 gives a summary of the methodology applied. To create our dataset, we automatically 
scraped tuberculosis and lymphoma case reports from ScienceDirect through their search API using the 
following search terms; “tuberculosis case report’, “tuberculosis case report”. The case reports were 
restricted based on title, as described in [20]. For each search result returned, we retrieved the full article 
using ScienceDirect’s Full-text retrieval API, then extracted the second section as the case report. This was 
achieved using a Python library called Beautiful Soup. A summary of the data collection process is shown in 
Figure 2. 


2.2. Data pre-processing 
The first part of preparing the data for our machine learning algorithms was done using “natural 
ee toolkit” (NLTK), a Python module for NLP. This process consisted of the following steps: 
Contractions expansion; using the ‘contractions’ Python package, known shortened combinations of 
words were expanded back to their original form. 

— Tokenization; each document was split into a series of words. Punctuation, numbers and special 
characters were then removed and letters converted to lower case. 

— Stopwords removal; recurrent English words which convey little to no information, such as articles and 
pronouns, were removed from the text. NLTK’s stopwords list was extended to also include terms such as 
patient’, disease’, ’using’, figure’, ’fig’, ’clinic’, *hospital’, ’et’,’al’. These terms appeared in multiple 
texts without bringing information necessary to our classification task. 

— Lemmatization; using NLTK’s WordNetLemmatizer algorithm, words were reduced to their root form in 
an effort to group together similar words (e.g.; plural words were converted to their singular form). 

The result data were then converted from free text into a vector space using term frequency-inverse 
document frequency. Extra pre-processing consisted of extracting the age and gender of each patient. The 

detailed feature extraction process is reported in [20]. 


2.3. Data exploration 
Using the “scikit-learn” library in Python, k-means++ clustering was applied to the vectorised 
dataset in order to group together similar case reports. The algorithm is described as [22]: 
a. Choose k initial centroids 
For k iterations: 
— For each data point, calculate the Euclidian distance with the closest centroid. 
— Choose a centroid using a distribution specified by the squared Euclidean distances. 
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b. Do 
— Assign each data entry to the cluster specified by the closest centroid. 
— Compute the mean of each cluster and assign that value as the new centroid. until the centroids stay the 
same between consecutive iterations. 


BELE, Data pre- Text Text Models’ 


collection processing clustering classification evaluation 





Figure 1. Methodology 
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Figure 2. Web scraping using Beautiful Soup and Python 


The optimal number of clusters (k) was decided based on silhouette scores, which measure how 
cohesive and distinguishable clusters are. In (1) is the formula used: 


co (1) 


max(a, b) 
where a is the average distance between each data point and other data points in the same cluster and b is the 


average distance between each data point and other data points in the closest cluster. For each data point, a 
and b are calculated as follows: 


| 1 | 
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2.4. Text classification 
We implemented the following algorithms as a benchmark: logistic regression, k-Nearest 

Neighbours (KNN), artificial neural network (ANN), Naive Bayes, support vector machines (SVM) and 

perceptron. There are brief descriptions of each algorithm: 

a. Decision trees; this method returns a tree-like structure, where internal nodes perform tests based on 
attribute values and each branch represents the outcome of the test. The tree ends in leaf nodes, which 
are associated with the most probable decision. [23]. Instances are classified by traversing the tree and 
applying rules at each internal node until a decision node is reached [24]. 

b. Artificial neural network; an ANN consists of layers of artificial neurons which are connected with each 
other. Input data traverse the layers, which process it and output a result. [25, 26]. Each neuron receives 
the input data from neurons in the previous layer, and each neuron-to-neuron connection has a weight 
representing its strength [25]. 

We used a multi layer perceptron (MLP) of one hidden layer with 100 hidden units. This ANN 

determines the input weights of each linear model as follows [27]: 

— Initialize w=0 

— Go through the data points { xi, yi } 

— if a data point is misclassified then w — w + asign(f(xj))xi 

— Until all the data are correctly classified 

c. Naive Bayes; Naive Bayes is a simple, statistics-based method, which predicts a class (Y) for a new 
example (X) based on the largest a posteriori probability, previous experience and event probability 
[28]. 

The probability of X belonging to a class c is given by the following formula. 


P(X|c)P(Y) 
SK) (4) 





P(e 





= 


where: 

P(c): probability of class c 

P(X): probability of the predictors X 

P(Xlc): probability of having X features given class c 

P(clX): probability of an instance X belonging to class c given the value of its dependent variables [29] 

d. Support vector machines; using a dataset of n features, a Support Vector Machine (SVM) attempts to 
find a decision boundary which maximises the margin between two observed classes [30]. This makes it 
a robust choice for binary classification. In the simplest case, SVMs must come up with a linear 
classifier of the form [31]: 


f(x) =w'x +b (5) 


One method of determining the input weights is the perceptron algorithm described above. 

e. k-Nearest Neighbours; this method classifies a new instance by finding the k most similar instances in 
an existing dataset. The similarity is determined using metrics such as Euclidean distance or 
Mahalanobis distance [32]. With two feature vectors A=(x1,X2,...,Xm) and B=(1,y2,...,Ym), representing 
two data points with m features, the Euclidean distance is calculated as: 


distance(A, B) = » (6) 





We evaluated the performance of each algorithm using classification accuracy, precision and recall. 
Accuracy evaluates the ratio of correctly classified instances. On the other hand, precision gives us the ratio 
of true positives among all instances classified as postive. Finally, recall computes the ratio of positive 
instances that were correctly classified. 

For each evaluation metric above, the performance of each algorithm was estimated using cross- 
validation. The dataset was randomly split into 5 subsets then each algorithm run 5 times, with 4 subsets used 
as for training and one used for testing. 
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3. RESULT AND DISCUSSION 

The search terms submitted to the ScienceDirect API provided 6080 and 4034 articles for 
tuberculosis and lymphoma, respectively. After automatic title review, 546 TB and 765 lymphoma case 


reports were kept for our study. Figure 3 gives us a quick preview of some features obtained using TF-IDF. 


old girl admitted pediatric disease department... difficile bristol ewing german hypermetabolism waldeyers exceptionally 


0.0 0.0 0.000000 0.024603 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
00 0.0 0.042663 0.158406 0.0 0.180864 .. 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
0.0 0.0 0.000000 0.000000 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
0.0 0.0 0.000000 0.111094 0.0 0.000000 . 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
0.0 0.0 0.000000 0.000000 0.0 0.000000... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 


Figure 3. Text features screenshot 


Looking at Figure 4, we see that highest average silhouette occurs when n=3. Considering three (3) 
clusters is therefore optimal in this case, since it minimises similarities between different clusters while 
maximising similarities within each cluster. This means that case reports are less likely of being assigned to 
the wrong cluster. 


Silhouette score 
O =- 


2 3 4 5 6 7 8 9 10 
Number of clusters 


Figure 4. Silhouette analysis 


3.1. Cluster analysis 

Figure 5 shows a word cloud for each cluster, which help visualise the most important words per 
cluster. The most frequent words in Cluster 1, such as “hodgkin” and “cell” suggest that this cluster mainly 
contains lymphoma case reports. Examples of lymphoma cases that were assigned to this cluster include 
those reported by [33]-[35]. TB cases were mostly allocated to Cluster 2. These cases include those reported 
by [36]-[38]. However, it also contained articles discussing tuberculosis, which were not excluded during title 
review but were not case reports [39], [40]. The documents in Cluster O were neither tuberculosis nor 
lymphoma case reports. After analysis, it was found that this cluster consisted of many cases of diseases 
wrongly diagnoses as TB, as reported by [41], [42]. The cluster also contained cases where had another 
disease on top of tuberculosis or lymphoma. 

Analysing the age of patients in the different clusters revealed that lymphoma patients were in 
average older than TB patients, with respective mean ages of about 53 and 40 years old as shown in 
Figure 6. This is consistent with previous findings indicating that lymphoma cases tend to occur in older 
patients [43], [44]. We also notice that lymphoma cases had a higher proportion of reported male patients. 
After pre-processing the text and vectorising, we obtained 7088 features to be fed into machine learn ing 
algorithms. Table 1 shows the average cross-validation performances of each algorithm in terms of accuracy, 
recall and precision. 
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Figure 5. Clusters’ word clouds, (a) cluster 0: others, (b) cluster 1: lymphoma, (c) cluster 2: TB 
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Figure 6. Patients’ overview, (a) age distribution, (b) gender distribution 


Table 1. Algorithms’ performances 


Accuracy Precision Recall 

Logistic Regression 86.6% 88.1% 87.4% 

kNN 70.5% 76.7% 66.6% 
Decision Trees 92.3% 91% 93% 

Naive Bayes 86.5% 83.7% 88.7% 

SVM 45.6% 15.2% 33.3% 

Perceptron 93.1% 91.9% 93.7% 


Performance evaluation of the various algorithms showed that the Multi-Layer Perceptron algorithm 
best identified the correct class of case reports (with 93.1% accuracy). This method also achieved the highest 
recall score (94.1%) and the highest positive predictive value, with a precision score of 95.4%. It therefore 
minimised the possibility of misclassifying a case report and maximise the number of documents from a 
given class to be identified correctly. 
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These results show that machine algorithm can differentiate between TB and lymphoma case reports 
with high accuracy. Given that the reported results are cross-validation scores, it is likely that the trained 
model will perform well on unseen case reports. If implemented to classify case reports, it can help feed the 
right data into a diagnosis or referral support system. Such a system can be used to screen patients and detect 
lymphoma cases earlier, potentially improving the patients’ prognosis. This could be extremely useful in 
diagnosing cancer in people with HIV-related lymphoma, who tend to show non-specific symptoms [45]. It is 
important to note that SVM performed very poorly, most likely due to the fact that the algorithm’s default 
sklearn parameters were used. Future research will therefore look into tuning the algorithm and selecting their 
optimal parameters. 


4. CONCLUSION 

Since tuberculosis symptoms are shared by many other diseases, there is a high probability of 
misdiagnosis, especially in areas with restricted resources. And although there are various diagnosis machine 
learning systems, this study focuses on collecting and exploring data for a system dedicated to differentiating 
between tuberculosis and lymphoma. 

As a Starting point, the study used web scraping to collect available TB and lymphoma case reports, 
then used unsupervised methods to explore these latter. Case reports were assigned to one of three clusters: 
lymphoma, TB and “others”. The results obtained after applying various classification algorithms on the 
dataset showed that the MLP model outperformed other algorithms when it came to accuracy, recall as well 
as precision, making it most likely to classify a case report correctly. This provides us with a tool for 
collecting additional case reports from different while ensuring the quality of the collected data. 

Future research will aim to improve the MLP and decision tree models by tuning their hyper- 
parameters. The pre-processing will also compare the performance when using stemming instead of 
lemmatization, since words like “abdomen” and “abdominal” are still seen as different concepts using the 
latter method. We will further collect case reports for the extraction of semantic features, such as patient 
symptoms. The resulting feature space will then be used to train a TB/Lymphoma screening support system. 
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